Draft analysis¶


Group name: Gruppe A


Introduction¶

This section includes an introduction to the project motivation, data, and research question. Include a data dictionary

In the last 30 years, the dating approach has changed and has become increasingly difficult. The willingness to date has decreased, dating is too expensive and time consuming, we have too many (perceived) options to date someone and we struggle because of accepting too easily negative sex stereotypes. In the 19th century, a custom in the United States called New Year’s Calling, was that on New Year's Day many young, single women would hold an Open House (a party or reception during which a person's home is open to visitors) on 1 January where they would invite eligible bachelors, both friends and strangers, to stop by for a brief (no more than 10–15-minute) visit. This custom was established with the term SpeedDating as a registered trademark by Aish HaTorah, who began hosting SpeedDating events in 1998.

10 years later, Fisman et al. conducted a survey regarding speed dating habits and collected 8,000 observations during his 2 – year observation in his paper Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment. Because speed dating has become more and more interesting in the last few years and also through Corona a completely new dating approach has emerged, we want to discuss contexts in speed dating. With the data from this survey, we want to answer the following research questions:

  • What are the most effective characteristics to achieve a match in opposite sex speed dating?

To answer our research question, we defined the following sub-questions to strengthen our main research question:

  • Do specific characteristics affect the match selection of the survey participants?
  • Do these specific characteristics occur in both sexes?
  • What type of persons participate in speed dating events?
  • After three weeks, how many contacts did each type of person have?
  • Is there a significant difference between the number of men calling women or women calling men after three weeks?

The following hypotheses support our research question:

Null hypothesis:

  • There is no affection of having specific characteristics regarding match selection of the survey participants
  • There is no correlation between shared interests, attributes and getting a match

Hypotheses:

  • Survey participants who both have the specific characteristics samerace and opposite gender tend to achieve more matches
  • Survey participants with a higher income tend to achieve more matches than survey participants with a lower income
  • Achieving matches because of having the same specific characteristics occur in both sexes
  • Three weeks after the event, males called women more often

Data dictionary¶

General information¶

Name Description Role Type Format
iid Unique subject number (wave + id + gender) ID numeric int
id Subject number within wave ID numeric int
gender Gender of the person. Female = 0, Male = 1 predictor nominal category
idg Subject number within gender (id + gender) ID numeric int
condtn Condition of the wave, 1 = Limited choice, 2 = extensive choice predictor nominal category
wave ID of the event ID numeric int
round Number of people that met in wave predictor numeric int
position Station number where met partner predictor numeric int
positin1 Station number where started predictor numeric int
order The number of date that night when met partner predictor numeric int
partner Partner's ID number the night of event ID numeric int
pid Partner's IID number ID numeric int
match 1 = yes, 0 = no response nominal category
int_corr Correlation between participant's and partner's ratings of interests in Time 1 predictor numeric float
samerace Participant and the partner were the same race. 1 = yes, 0 = no predictor nominal category
age_o Age of partner predictor nominal category
race_o Race of partner predictor nominal category
pf_o_att Partner's stated preference at Time 1. The sum of all pf_o_ elements must be 100. predictor numeric float
pf_o_sin Partner's stated preference at Time 1. The sum of all pf_o_ elements must be 100. predictor numeric float
pf_o_int Partner's stated preference at Time 1. The sum of all pf_o_ elements must be 100. predictor numeric float
pf_o_fun Partner's stated preference at Time 1. The sum of all pf_o_ elements must be 100. predictor numeric float
pf_o_amb Partner's stated preference at Time 1. The sum of all pf_o_ elements must be 100. predictor numeric float
pf_o_sha Partner's stated preference at Time 1. The sum of all pf_o_ elements must be 100. predictor numeric float
dec_o Decision of partner the night of event predictor nominal category
attr_o Attractive. Rating by partner the night of the event from 1 (awful) to 10 (great) predictor numeric int
sinc_o Sincere. Rating by partner the night of the event from 1 (awful) to 10 (great) predictor numeric int
intel_o Intelligent. Rating by partner the night of the event from 1 (awful) to 10 (great) predictor numeric int
fun_o Fun. Rating by partner the night of the event from 1 (awful) to 10 (great) predictor numeric int
amb_o Ambitious. Rating by partner the night of the event from 1 (awful) to 10 (great) predictor numeric int
shar_o Shared Interests/Hobbies. Rating by partner the night of the event from 1 (awful) to 10 (great) predictor numeric int
like_o Overall, how much do oyu like this person. 1 (don't like at all) to 10 (like a lot) predictor numeric int
prob_o How probable do you think it is that this person will say 'yes' for you? 1 (not probable) to 10 (extemely probable) predictor numeric int
met_o Have you met this person before? (1 = yes, 2 = no) predictor ordinal category

Time 1: Survey filled out by students that are interested in participating in order to register for the event¶

Name Description Role Type Format
age Age of the person predictor numeric int
field Field of study predictor nominal string
field_cd Field of study coded.
1= Law
2= Math
3= Social Science, Psychologist
4= Medical Science, Pharmaceuticals, and Bio Tech
5= Engineering
6= English/Creative Writing/ Journalism
7= History/Religion/Philosophy
8= Business/Econ/Finance
9= Education, Academia
10= Biological Sciences/Chemistry/Physics
11= Social Work
12= Undergrad/undecided
13=Political Science/International Affairs
14=Film
15=Fine Arts/Arts Administration
16=Languages
17=Architecture
18=Other
predictor nominal category
mn_sat Median SAT score for the undergraduate institution where attended. Proxy for intelligence.
tuition Tuition listed for each response to undergrad
race Race of the attendee
1 = Black/African American
2 = European/Caucasian-American
3 = Latino/Hispanic American
4 = Asian/Pacific Islander/Asian-American
5 = Native American
6 = Other
predictor nominal category
imprace How important is it that a person you date be of the same racial/ethic background? (1 - 10) predictor numeric int
imprelig How important is it that a person you date be of the same religious background? (1 - 10) predictor numeric int
from Where the person is originally from predictor nominal string
zipcode Zip code of the grow up area predictor nominal category
income Median household income based on zipcode predictor numeric float
goal What is the goal in participating in this event?
1 = Seemed like a fun night out
2 = To meet new people
3 = To get a date
4 = Looking for a serious relationship
5 = To say I did it
6 = Other
predictor nominal category
date How frequently do you go on dates?
1 = Several times a week
2 = Twice a week
3 = Once a week
4 = Twice a month
5 = Once a month
6 = Several times a year
7 = Almost never
predictor ordinal category
go out How often do you go out (not necessarily on dates)?
1 = Several times a week
2 = Twice a week
3 = Once a week
4 = Twice a month
5 = Once a month
6 = Several times a year
7 = Almost never
predictor ordinal category
career What is your intended career? predictor nominal string
career_c Career coded.
1 = Lawyer
2 = Academic/Research
3 = Psychologist
4 = Doctor/Medicine
5 =Engineer
6 = Creative Arts/Entertainment
7 = Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin
8 = Real Estate
9 = International/Humanitarian Affairs
10 = Undecided
11 = Social Work
12 = Speech Pathology
13 = Politics
14 = Pro sports/Athletics
15 = Other
16 = Journalism
17 = Architecture
predictor nominal category
sports Playing sports/athletics. Interest in this Hobby from 1 - 10. predictor numeric int
tvsports Watching sports. Interest in this Hobby from 1 - 10. predictor numeric int
excersice Body building/exercising. Interest in this Hobby from 1 - 10. predictor numeric int
dining Dining out. Interest in this Hobby from 1 - 10. predictor numeric int
museums Museums/galleries. Interest in this Hobby from 1 - 10. predictor numeric int
art Art. Interest in this Hobby from 1 - 10. predictor numeric int
hiking Hiking/camping. Interest in this Hobby from 1 - 10. predictor numeric int
gaming Gaming. Interest in this Hobby from 1 - 10. predictor numeric int
clubbing Dancing/clubbing. Interest in this Hobby from 1 - 10. predictor numeric int
reading Reading. Interest in this Hobby from 1 - 10. predictor numeric int
tv Watching TV. Interest in this Hobby from 1 - 10. predictor numeric int
theater Theater. Interest in this Hobby from 1 - 10. predictor numeric int
movies Movies. Interest in this Hobby from 1 - 10. predictor numeric int
concerts Going to concerts. Interest in this Hobby from 1 - 10. predictor numeric int
music Music. Interest in this Hobby from 1 - 10. predictor numeric int
shopping Shopping. Interest in this Hobby from 1 - 10. predictor numeric int
yoga Yoga/meditation. Interest in this Hobby from 1 - 10. predictor numeric int
exhappy Overall, how happy do you expect to be with the people you meet during the event? (1 - 10) predictor numeric int
expnum Out of 20 people, how many do you expect will be interested in dating you? predictor numeric int
attr1_1 What do you (personally) look for in the opposite sex. The sum of all attr1_1 elements must be 100. predictor numeric float
sinc1_1 What do you (personally) look for in the opposite sex. The sum of all attr1_1 elements must be 100. predictor numeric float
intel1_1 What do you (personally) look for in the opposite sex. The sum of all attr1_1 elements must be 100. predictor numeric float
fun1_1 What do you (personally) look for in the opposite sex. The sum of all attr1_1 elements must be 100. predictor numeric float
amb1_1 What do you (personally) look for in the opposite sex. The sum of all attr1_1 elements must be 100. predictor numeric float
shar1_1 What do you (personally) look for in the opposite sex. The sum of all attr1_1 elements must be 100. predictor numeric float
attr4_1 What do you think your fellow men/woman look for in the opposite sex. The sum of all attr4_1 elements must be 100. predictor numeric float
sinc4_1 What do you think your fellow men/woman look for in the opposite sex. The sum of all attr4_1 elements must be 100. predictor numeric float
intel4_1 What do you think your fellow men/woman look for in the opposite sex. The sum of all attr4_1 elements must be 100. predictor numeric float
fun4_1 What do you think your fellow men/woman look for in the opposite sex. The sum of all attr4_1 elements must be 100. predictor numeric float
amb4_1 What do you think your fellow men/woman look for in the opposite sex. The sum of all attr4_1 elements must be 100. predictor numeric float
shar4_1 What do you think your fellow men/woman look for in the opposite sex. The sum of all attr4_1 elements must be 100. predictor numeric float
attr2_1 What do you think the opposite sex looks for in a date. The sum of all attr2_1 elements must be 100. predictor numeric float
sinc2_1 What do you think the opposite sex looks for in a date. The sum of all attr2_1 elements must be 100. predictor numeric float
intel2_1 What do you think the opposite sex looks for in a date. The sum of all attr2_1 elements must be 100. predictor numeric float
fun2_1 What do you think the opposite sex looks for in a date. The sum of all attr2_1 elements must be 100. predictor numeric float
amb2_1 What do you think the opposite sex looks for in a date. The sum of all attr2_1 elements must be 100. predictor numeric float
shar2_1 What do you think the opposite sex looks for in a date. The sum of all attr2_1 elements must be 100. predictor numeric float
attr3_1 Rate yourself from 1 - 10. predictor numeric int
sinc3_1 Rate yourself from 1 - 10. predictor numeric int
intel3_1 Rate yourself from 1 - 10. predictor numeric int
fun3_1 Rate yourself from 1 - 10. predictor numeric int
amb3_1 Rate yourself from 1 - 10. predictor numeric int
shar3_1 Rate yourself from 1 - 10. predictor numeric int
attr5_1 How do you think others perceive you? 1 = awful, 10 = great predictor numeric int
sinc5_1 How do you think others perceive you? 1 = awful, 10 = great predictor numeric int
intel5_1 How do you think others perceive you? 1 = awful, 10 = great predictor numeric int
fun5_1 How do you think others perceive you? 1 = awful, 10 = great predictor numeric int
amb5_1 How do you think others perceive you? 1 = awful, 10 = great predictor numeric int
shar5_1 How do you think others perceive you? 1 = awful, 10 = great predictor numeric int

Careful: For all attributes _1, _2 and *_4, wave 6-9 rated the importance of the attributes in a potential date on a scale of 1-10 (1=not at all important, 10=extremely important).
Waves 1-5 and 10-21 distribued 100 points among the attributes. Total points must equal 100.

Round 2: Filled out by subjects after each "date" during the event.¶

Name Description Role Type Format
dec Decision if you want to see the person again (1) or not (0) predictor nominal category
attr Rating of the attribute for this person from 1 - 10. predictor numeric int
sinc Rating of the attribute for this person from 1 - 10. predictor numeric int
intel Rating of the attribute for this person from 1 - 10. predictor numeric int
fun Rating of the attribute for this person from 1 - 10. predictor numeric int
amb Rating of the attribute for this person from 1 - 10. predictor numeric int
shar Rating of the attribute for this person from 1 - 10. predictor numeric int
like Overall, how much do oyu like this person. 1 (don't like at all) to 10 (like a lot) predictor numeric int
prob How probable do you think it is that this person will say 'yes' for you? 1 (not probable) to 10 (extemely probable) predictor numeric int
met Have you met this person before? (1 = yes, 2 = no) predictor ordinal category

Half way through meeting all potential dates during the night of the event on their scorecard¶

Name Description Role Type Format
attr1_s What do you (personally) look for in the opposite sex. 1 - 10 rating. predictor numeric int
sinc1_s What do you (personally) look for in the opposite sex. 1 - 10 rating. predictor numeric int
intel1_s What do you (personally) look for in the opposite sex. 1 - 10 rating. predictor numeric int
fun1_s What do you (personally) look for in the opposite sex. 1 - 10 rating. predictor numeric int
amb1_s What do you (personally) look for in the opposite sex. 1 - 10 rating. predictor numeric int
shar1_s What do you (personally) look for in the opposite sex. 1 - 10 rating. predictor numeric int
attr4_s Rate yourself from 1 - 10 predictor numeric int
sinc4_s Rate yourself from 1 - 10 predictor numeric int
intel4_s Rate yourself from 1 - 10 predictor numeric int
fun4_s Rate yourself from 1 - 10 predictor numeric int
amb4_s Rate yourself from 1 - 10 predictor numeric int

Time 2: Survey is filled out the day after participating in the event. Subjects must have submitted this in order to be sent their matches.¶

Name Description Role Type Format
satis_2 Overall, how satisfied were you with the people you met? (1=not at all satisfied, 10=extremely satisfied) predictor numeric int
length Four minutes is:
1 = Too little,
2 = Too much
3 = Just Right
predictor nominal category
numdat_2 The number of Speed "Dates" you had was:
1 = Too few,
2 = Too many,
3 = Just right
predictor nominal category

... and again the same questions regarding attributes

Time 3: Subjects filled out 3-4 weeks after they had been sent their matches¶

Name Description Role Type Format
you_call How many have you contacted to set up a date? predictor numeric int
them_cal How many have contacted you? predictor numeric int
date_3 Have you been on a date with any of your matches? Yes=1 No=2 predictor nominal category
numdat_3 If yes, how many of your matches have you been on a date with so far? predictor numeric int
num_in_3 If yes, how many? predictor numeric int

... and again the same questions regarding attributes


  • Role: response, predictor, ID (ID columns are not used in a model but can help to better understand the data)

  • Type: nominal, ordinal or numeric

  • Format: int, float, string, category, date or object

Descriptive terms for our used variables¶

Name Description Descriptive term
calls Event of a participant conducting a "you_call" or "them_cal" with the other party Calls of participants
attr Rating of the attribute for this person from 1 - 10. Attractivity of speed dating participant
sinc Rating of the attribute for this person from 1 - 10. Sincerety of speed dating participant
intel Rating of the attribute for this person from 1 - 10. Intelligence of speed dating participant
fun Rating of the attribute for this person from 1 - 10. Humor of speed dating participant
amb Rating of the attribute for this person from 1 - 10. Ambition of speed dating participant
shar Rating of the attribute for this person from 1 - 10. Shared Interests/Hobbies of the speed dating participant to the other party
like Overall, how much do oyu like this person. 1 (don't like at all) to 10 (like a lot) Strength of like of speed dating participant to the other party
prob How probable do you think it is that this person will say 'yes' for you? 1 (not probable) to 10 (extemely probable) Probability of speed dating participant to like the other party
met Have you met this person before? (1 = yes, 2 = no) Meeting indicator of participants
gender Gender of the person. Female = 0, Male = 1 Gender of speed dating participant
order The number of date that night when met partner Order of date of speed dating participant and the other party during event
match 1 = yes, 0 = no Match of the speed dating participant and the other party
int_corr Correlation between participant's and partner's ratings of interests in Time 1 Correlation of the speed dating participant and the other party
samerace Participant and the partner were the same race. 1 = yes, 0 = no Indicates, if the speed dating participant and the other party have the same race
age Age of the person Age of speed dating participant
age_o Age of partner Age of other party
race Race of the attendee
1 = Black/African American
2 = European/Caucasian-American
3 = Latino/Hispanic American
4 = Asian/Pacific Islander/Asian-American
5 = Native American
6 = Other
Race of speed dating participant
race_o Race of partner Race of other party
imprace How important is it that a person you date be of the same racial/ethic background? (1 - 10) Importance of the other party having the same race as the speed dating participant
intel_o Intelligent. Rating by partner the night of the event from 1 (awful) to 10 (great) Intelligence of the other party
sinc_o Sincere. Rating by partner the night of the event from 1 (awful) to 10 (great) Sincerety of the other party
like_o Overall, how much do oyu like this person. 1 (don't like at all) to 10 (like a lot) Strength of like of to the other party
prob_o How probable do you think it is that this person will say 'yes' for you? 1 (not probable) to 10 (extemely probable) Probability of the other party to like speed dating participant
fun_o Fun. Rating by partner the night of the event from 1 (awful) to 10 (great) Humor of the other party
satis_2 Generic Id Generic Id
amb_o Ambitious. Rating by partner the night of the event from 1 (awful) to 10 (great) Ambition of the other party
shar_o Shared Interests/Hobbies. Rating by partner the night of the event from 1 (awful) to 10 (great) Shared Interests/Hobbies of the other party to speed dating participant
attr_o Attractive. Rating by partner the night of the event from 1 (awful) to 10 (great) Attractivity of the other party
met_o Have you met this person before? (1 = yes, 2 = no) Meeting indicator of the other party
exphappy Overall, on a scale of 1-10, how happy do you expect to be with the people you meet during the speed-dating event? Expected Happiness of meeting people
pid partner's iid number partner's iid number

Setup¶

In this section we will import important libraries and funtions in order to create our data frames, calculations and visualizations

In [ ]:
%matplotlib inline

import pickle

import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns

from sklearn.linear_model import LogisticRegressionCV

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.metrics import ConfusionMatrixDisplay, classification_report, RocCurveDisplay, roc_auc_score, make_scorer, precision_recall_curve, PrecisionRecallDisplay

import matplotlib.pyplot as plt
alt.data_transformers.disable_max_rows()
Out[ ]:
DataTransformerRegistry.enable('default')

Data¶

Import data¶

Data was taken from here: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets/SpeedDating.csv

We create our data frame out of the imported csv data from our data source

In [ ]:
df = pd.read_csv("../data/interim/TransformedData",delimiter=",", index_col=0)

We have 195 attributes in this dataset, this is a lot. We already see some NaN values which we'll have to eliminate later.

In [ ]:
df.head()
Out[ ]:
iid id gender idg condtn wave round position positin1 order ... attr3_3 sinc3_3 intel3_3 fun3_3 amb3_3 attr5_3 sinc5_3 intel5_3 fun5_3 amb5_3
0 1 1.0 0 1 1 1 10 7 NaN 4 ... 5.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN NaN
1 1 1.0 0 1 1 1 10 7 NaN 3 ... 5.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN NaN
2 1 1.0 0 1 1 1 10 7 NaN 10 ... 5.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN NaN
3 1 1.0 0 1 1 1 10 7 NaN 5 ... 5.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN NaN
4 1 1.0 0 1 1 1 10 7 NaN 7 ... 5.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN NaN

5 rows × 195 columns

Data structure¶

Currently there are 175 float values, 13 int values and 7 object values. We already identified some values that should rather be categorical.

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8378 entries, 0 to 8377
Columns: 195 entries, iid to amb5_3
dtypes: float64(175), int64(13), object(7)
memory usage: 12.5+ MB

Data corrections¶

In fact we created lists for each data type so we can map the attributes easier.

In [ ]:
cat_vars = [
    "gender", 
    "condtn",
    "match",
    "samerace",
    "race_o",
    "dec_o",
    "met_o",
    "field_cd",
    "race",
    "zipcode",
    "goal",
    "date",
    "go_out",
    "career_c",
    "dec",
    "met",
    "length",
    "numdat_2",
    "date_3",
]

float_vars = [
    "int_corr",
    "pf_o_att",
    "pf_o_sin",
    "pf_o_int",
    "pf_o_fun",
    "pf_o_amb",
    "pf_o_sha",
    "income",
    "attr1_1",
    "sinc1_1",
    "intel1_1",
    "fun1_1",
    "amb1_1",
    "shar1_1",
    "attr4_1",
    "sinc4_1",
    "intel4_1",
    "fun4_1",
    "amb4_1",
    "shar4_1",
    "attr2_1",
    "sinc2_1",
    "intel2_1",
    "fun2_1",
    "amb2_1",
    "shar2_1"
]

int_vars = [
    "attr_o",
    "sinc_o",
    "intel_o",
    "fun_o",
    "amb_o",
    "shar_o",
    "like_o",
    "prob_o",
    "age",
    "age_o",
    "imprace",
    "imprelig",
    "sports",
    "tvsports",
    "excersice",
    "dining",
    "museums",
    "art",
    "hiking",
    "gaming",
    "clubbing",
    "reading",
    "tv",
    "theater",
    "movies",
    "concerts",
    "music",
    "shopping",
    "yoga",
    "exhappy",
    "attr3_1",
    "sinc3_1",
    "intel3_1",
    "fun3_1",
    "amb3_1",
    "attr5_1",
    "sinc5_1",
    "intel5_1",
    "fun5_1",
    "amb5_1",
    "attr",
    "sinc",
    "intel",
    "fun",
    "amb",
    "shar",
    "like",
    "prob",
    "attr1_s",
    "sinc1_s",
    "intel1_s",
    "fun1_s",
    "amb1_s",
    "shar1_s",
    "attr4_s",
    "sinc4_s",
    "intel4_s",
    "fun4_s",
    "amb4_s",
    "satis_2",
    "iid",
    "id",
    "idg",
    "wave",
    "round",
    "order",
    "partner",
    "pid",
    "expnum",
    "you_call",
    "them_cal",
    "numdat_3",
    "num_in_3",
    "position",
    "positin1",
]

str_vars = [
    "field",
    "from",
    "career"
]

unused_vars = [
    "undergrd",
    "mn_sat",
    "tuition"
]

Thanks to these lists we can just map all attributes in one go.

In [ ]:
df[cat_vars]=df[cat_vars].astype("category",copy=False)
df[float_vars]=df[float_vars].astype("float",copy=False)
df[str_vars]=df[str_vars].astype("str",copy=False)

Since we assume that a call from a participant "you_call" or "them_cal" with the other party excludes a call from the other party, we sum these two variables at this point.

In [ ]:
df['calls'] = df['you_call'] + df['them_cal']

We already see that a lot of participants didn't call at all (1.284), but on the other hand the majority (2.690) was called or called at least one person.

In [ ]:
df['calls'].value_counts()
Out[ ]:
0.0     1284
1.0      935
2.0      848
4.0      275
3.0      254
5.0      196
6.0      115
9.0       21
14.0      18
10.0      18
22.0      10
Name: calls, dtype: int64

We can see that a lot more male (2.422) are calling female than the other way round (681).
On the other hand, both sexes said that they have been called more often than there were actual calls (male 1.035/681 and female 2.866/2.422), maybe there is some bias about these numbers or the data is incomplete.

In [ ]:
alt.Chart(df).mark_bar().encode(
    alt.X('gender', title='female,male'),
    y='sum(you_call)',
    color='gender',
    tooltip='sum(you_call)'
) | alt.Chart(df).mark_bar().encode(
    alt.X('gender', title='female,male'),
    y='sum(them_cal)',
    color='gender',
    tooltip='sum(them_cal)'
).properties(
    title='Distribution of calls per gender'
)
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
Out[ ]:

In order to answer our research question, we select the following variables from our variables list and rename them for a better understanding.

We also use descriptive terms in order to recognize our variables better

In [ ]:
df = df.rename(columns=
{
    "match": "Match",
    "fun": "Humor",
    "shar": "Shared_interests",
    "attr": "Attractivity",
    "prob": "Probability", 
    "intel": "Intelligence",
    "sinc": "Sincerety",
    "amb": "Ambition",
    "gender": "Gender",
    "samerace": "Same_race",
    "race": "Race",
    "race_o": "Race_opposite",
    "calls": "Calls",
    "age": "Age",
    "age_o": "Age_opposite",
    "imprace": "Importance_same_race",
    "order": "Order",
    "int_corr": "Interests_correlation"
})

variables = [
    'Match',
    'Humor',
    'Shared_interests',
    'Attractivity',
    'Probability',
    'Intelligence',
    'Sincerety',
    'Ambition',
    'Gender',
    'Same_race',
    'Race',
    'Race_opposite',
    'Calls',
    'Age',
    'Age_opposite',
    'Importance_same_race',
    'Order',
    'Interests_correlation'
]

df = df[variables]

Let's have a look how many NaN values there are in the dataset.

We can see that the variables you_call and them_cal, that we merged earlier, have a lot of NaN values.
This information was collected after the events so many people probably didn't answer these questions at all.

In [ ]:
g=sns.displot(
    data=df.isna().melt(value_name="NaN"),
    y="variable",
    hue="NaN",
    multiple="fill",
)
g.set_axis_labels("Share", "Variables")
plt.title('NaN share of the observations per variable')
Out[ ]:
Text(0.5, 1.0, 'NaN share of the observations per variable')

Identitfy and drop NAs¶

We want to identify observations with missing values in our relevant variables in order to remove them from our data frame.
Obervations with missing values in the relevant variables colums will be removed.

In [ ]:
df
Out[ ]:
Match Humor Shared_interests Attractivity Probability Intelligence Sincerety Ambition Gender Same_race Race Race_opposite Calls Age Age_opposite Importance_same_race Order Interests_correlation
0 0 7.0 5.0 6.0 6.0 7.0 9.0 6.0 0 0 4.0 2.0 2.0 21.0 27.0 2.0 4 0.14
1 0 8.0 6.0 7.0 5.0 7.0 8.0 5.0 0 0 4.0 2.0 2.0 21.0 22.0 2.0 3 0.54
2 1 8.0 7.0 5.0 NaN 9.0 8.0 5.0 0 1 4.0 4.0 2.0 21.0 22.0 2.0 10 0.16
3 1 7.0 8.0 7.0 6.0 8.0 6.0 6.0 0 0 4.0 2.0 2.0 21.0 23.0 2.0 5 0.61
4 1 7.0 6.0 5.0 6.0 7.0 6.0 6.0 0 0 4.0 3.0 2.0 21.0 24.0 2.0 7 0.21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8373 0 5.0 NaN 3.0 5.0 5.0 5.0 NaN 1 0 2.0 3.0 2.0 25.0 26.0 1.0 5 0.64
8374 0 4.0 NaN 4.0 4.0 8.0 6.0 4.0 1 0 2.0 6.0 2.0 25.0 24.0 1.0 4 0.71
8375 0 8.0 NaN 4.0 5.0 8.0 7.0 8.0 1 0 2.0 3.0 2.0 25.0 29.0 1.0 10 -0.46
8376 0 4.0 5.0 4.0 5.0 5.0 6.0 NaN 1 0 2.0 4.0 2.0 25.0 22.0 1.0 16 0.62
8377 0 4.0 1.0 3.0 5.0 6.0 7.0 8.0 1 0 2.0 4.0 2.0 25.0 22.0 1.0 15 0.01

8378 rows × 18 columns

In [ ]:
df.isna().sum().sort_values(ascending=False)
Out[ ]:
Calls                    4404
Shared_interests         1067
Ambition                  712
Humor                     350
Probability               309
Intelligence              296
Sincerety                 277
Attractivity              202
Interests_correlation     158
Age_opposite              104
Age                        95
Importance_same_race       79
Race_opposite              73
Race                       63
Order                       0
Match                       0
Gender                      0
Same_race                   0
dtype: int64

For the calls variable, we assume that NaN had zero calls. This is of cause only an estimation.

For all the other attributes we drop the NaN values because the ratio is rather small.

In [ ]:
df['Calls'].fillna(value=0, inplace=True)
df.dropna(inplace=True)
df.reset_index(inplace=True, drop=True)
In [ ]:
g=sns.displot(
    data=df.isna().melt(value_name="NaN"),
    y="variable",
    hue="NaN",
    multiple="fill",
)
g.set_axis_labels("Share", "Variables")
plt.title('NaN share of the observations per variable')
Out[ ]:
Text(0.5, 1.0, 'NaN share of the observations per variable')

Variable lists¶

We define the predictor variable (y) as match and our features accordingly.

Because we can do different analysis on categorical and numerical values, we separate them in different lists.

The features we think are the most important (humor, shared interests etc.) are called personal attributes from now on.

In [ ]:
y_label = 'Match'

cat_features = ['Gender', 'Same_race', 'Race', 'Race_opposite', 'Importance_same_race']
personal_features = ['Humor', 'Shared_interests', 'Attractivity', 'Intelligence', 'Sincerety', 'Ambition']
num_features = ['Probability', 'Calls', 'Age', 'Age_opposite', 'Order', 'Interests_correlation']

features = cat_features + personal_features + num_features

X = df[features]
y = df[y_label]

Data splitting¶

We do a test/train split with 20/80% of the data.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    shuffle=True,
                                                    random_state=42)

Analysis¶

Descriptive statistics¶

In [ ]:
df_train = pd.DataFrame(X_train).copy()
df_train[y_label] = pd.DataFrame(y_train)
  • The first thing to notice here is that the personal attributes are collected bewteen 0 and 10.
  • The people asked were around the same age, mostly between 24 and 28.
  • The 22 calls seem to be an absolute outlier, as the median is 0 and even the 75% quantile is 1.
  • Humor, intelligence, sincerity and ambition was rated a bit higher than shared interests and attractivity.
  • The probability that the other will match was rated relatively low (median of probability is 5).
In [ ]:
df_train.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Importance_same_race 5427.0 3.847982 2.847182 0.00 1.00 3.00 6.00 10.00
Humor 5427.0 6.406118 1.940148 0.00 5.00 7.00 8.00 10.00
Shared_interests 5427.0 5.459370 2.133440 0.00 4.00 6.00 7.00 10.00
Attractivity 5427.0 6.201843 1.941594 0.00 5.00 6.00 8.00 10.00
Intelligence 5427.0 7.361618 1.558780 0.00 6.00 7.00 8.00 10.00
Sincerety 5427.0 7.168325 1.747358 0.00 6.00 7.00 8.00 10.00
Ambition 5427.0 6.746545 1.798458 0.00 6.00 7.00 8.00 10.00
Probability 5427.0 5.248940 2.132117 0.00 4.00 5.00 7.00 10.00
Calls 5427.0 0.839506 1.820310 0.00 0.00 0.00 1.00 22.00
Age 5427.0 26.289294 3.537744 18.00 24.00 26.00 28.00 55.00
Age_opposite 5427.0 26.300903 3.526730 18.00 24.00 26.00 28.00 55.00
Order 5427.0 8.846140 5.455792 1.00 4.00 8.00 13.00 22.00
Interests_correlation 5427.0 0.196842 0.303543 -0.83 -0.02 0.21 0.43 0.91

Let's check for differences between men and women.

In [ ]:
c1 = alt.Chart(df_train).mark_boxplot(extent='min-max').encode(
    alt.X(y_label, title='female | male'),
    alt.Y(alt.repeat("column"), type='quantitative'),
    color=alt.Color(y_label)
).properties(
    width=100,
    height=250
).repeat(
    column=personal_features
).properties(title="Gender differences per Attribute")

We can see that for all those attributes it's important to be over 6 to get a match, better around 8.

In [ ]:
alt.Chart(df).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(y_label),
    alt.Size('count()'),
    alt.Color(y_label),
    tooltip=['count()']
).properties(
    width=250,
    height=150
).repeat(
    column=personal_features
).interactive().properties(title="Attribute importance to get a match")
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
Out[ ]:

We can see there are a lot more no-matches than matches in our dataset.

In [ ]:
alt.Chart(df).mark_bar().encode(
    alt.X(y_label),
    alt.Y('count()'),
    alt.Color(y_label),
    tooltip = ['count()']
).properties(
    width=250,
    height=150
).interactive().properties(title="Match and no-match comparison")
Out[ ]:

Exploratory data analysis¶

  • First we can see that gender was almost equally distributed
  • Around 2/5 of the people had the same race
  • Most of the people were Euopean/Caucasian-American (race = 2), followed by Asian/Pacific Islander/Aisan-American (race = 4)
  • For most of the people, having the same race was not important (imprace = 1)
  • We already saw the characteristic of the personal attributes in descriptive statistics. While there are people that were rated very bad in humor and attractivity (< 5), intelligence, sincerity and ambition starts at around 5.
In [ ]:
alt.Chart(df_train).mark_bar().encode(
    alt.X(alt.repeat("column"), type="quantitative", bin=True),
    y='count()',
).properties(
    width=150,
    height=150
).repeat(
    column=features
).properties(title="Attribute distribution")
Out[ ]:
In [ ]:
alt.Chart(df_train).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    alt.Color(y_label)
).properties(
    width=150,
    height=150
).repeat(
    row=personal_features,
    column=personal_features
).interactive().properties(title="Attribute correlation")
Out[ ]:
In [ ]:
base = alt.Chart(df_train).mark_area().encode(
            y='density:Q',
            color=alt.Color(y_label)
        )

chart = alt.vconcat(data=df_train)
for var in num_features:
    chart &= base.transform_density(
        density=var,
        groupby=[y_label],
        counts=True,
        as_=[var, 'density']
    ).mark_area().encode(
        x=var
    ).properties(title="Attribute importance for match")

chart
Out[ ]:

Relationships¶

In [ ]:
corr = df_train.corr()
corr.round(2)
C:\Users\Sscho\AppData\Local\Temp\ipykernel_23984\2882414194.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = df_train.corr()
Out[ ]:
Importance_same_race Humor Shared_interests Attractivity Intelligence Sincerety Ambition Probability Calls Age Age_opposite Order Interests_correlation
Importance_same_race 1.00 -0.03 -0.05 -0.05 0.01 0.05 -0.01 -0.05 -0.07 -0.15 -0.01 0.04 -0.09
Humor -0.03 1.00 0.62 0.59 0.50 0.51 0.50 0.39 0.06 -0.01 -0.04 -0.04 0.02
Shared_interests -0.05 0.62 1.00 0.49 0.40 0.40 0.44 0.48 0.07 0.01 -0.00 0.01 0.04
Attractivity -0.05 0.59 0.49 1.00 0.40 0.41 0.37 0.30 0.05 0.01 -0.05 -0.03 0.01
Intelligence 0.01 0.50 0.40 0.40 1.00 0.67 0.63 0.29 0.08 -0.02 0.02 -0.09 0.05
Sincerety 0.05 0.51 0.40 0.41 0.67 1.00 0.47 0.35 0.07 -0.02 -0.00 -0.12 0.02
Ambition -0.01 0.50 0.44 0.37 0.63 0.47 1.00 0.29 0.04 -0.04 0.01 -0.06 0.03
Probability -0.05 0.39 0.48 0.30 0.29 0.35 0.29 1.00 0.01 -0.03 -0.00 -0.08 0.01
Calls -0.07 0.06 0.07 0.05 0.08 0.07 0.04 0.01 1.00 0.07 0.04 0.02 0.04
Age -0.15 -0.01 0.01 0.01 -0.02 -0.02 -0.04 -0.03 0.07 1.00 0.11 0.01 0.06
Age_opposite -0.01 -0.04 -0.00 -0.05 0.02 -0.00 0.01 -0.00 0.04 0.11 1.00 -0.00 0.09
Order 0.04 -0.04 0.01 -0.03 -0.09 -0.12 -0.06 -0.08 0.02 0.01 -0.00 1.00 0.01
Interests_correlation -0.09 0.02 0.04 0.01 0.05 0.02 0.03 0.01 0.04 0.06 0.09 0.01 1.00
In [ ]:
# take a look at all correlations
df[y_label]=df[y_label].astype("int",copy=False)
corr = df.corr()
df[y_label]=df[y_label].astype("category",copy=False)

corr[y_label].sort_values(ascending=False)
C:\Users\Sscho\AppData\Local\Temp\ipykernel_23984\3048780704.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = df.corr()
Out[ ]:
Match                    1.000000
Humor                    0.276822
Shared_interests         0.270565
Attractivity             0.263631
Probability              0.259991
Intelligence             0.173844
Sincerety                0.167161
Ambition                 0.145196
Calls                    0.128427
Interests_correlation    0.029220
Age_opposite            -0.029092
Order                   -0.029916
Age                     -0.035454
Importance_same_race    -0.055506
Name: Match, dtype: float64

Looks like those personal attributes are pretty important for a match.

In [ ]:
corr.style.background_gradient(cmap='Blues')
Out[ ]:
  Match Humor Shared_interests Attractivity Probability Intelligence Sincerety Ambition Calls Age Age_opposite Importance_same_race Order Interests_correlation
Match 1.000000 0.276822 0.270565 0.263631 0.259991 0.173844 0.167161 0.145196 0.128427 -0.035454 -0.029092 -0.055506 -0.029916 0.029220
Humor 0.276822 1.000000 0.615611 0.588918 0.390942 0.499193 0.508938 0.493825 0.048762 -0.017330 -0.035058 -0.023705 -0.037148 0.017535
Shared_interests 0.270565 0.615611 1.000000 0.487241 0.476955 0.400509 0.398810 0.434406 0.064319 0.007137 0.006335 -0.048842 0.008341 0.047083
Attractivity 0.263631 0.588918 0.487241 1.000000 0.278448 0.387448 0.404159 0.355826 0.042082 0.012357 -0.045859 -0.046805 -0.016335 0.021721
Probability 0.259991 0.390942 0.476955 0.278448 1.000000 0.280188 0.332913 0.282387 0.020695 -0.010177 -0.002334 -0.058198 -0.071314 0.019578
Intelligence 0.173844 0.499193 0.400509 0.387448 0.280188 1.000000 0.669162 0.627755 0.074275 -0.017792 0.037805 0.007115 -0.080771 0.047087
Sincerety 0.167161 0.508938 0.398810 0.404159 0.332913 0.669162 1.000000 0.463694 0.071973 -0.012146 0.004885 0.037386 -0.113905 0.013696
Ambition 0.145196 0.493825 0.434406 0.355826 0.282387 0.627755 0.463694 1.000000 0.033710 -0.038762 0.021913 -0.013042 -0.062048 0.029437
Calls 0.128427 0.048762 0.064319 0.042082 0.020695 0.074275 0.071973 0.033710 1.000000 0.070623 0.041661 -0.061535 0.025861 0.043113
Age -0.035454 -0.017330 0.007137 0.012357 -0.010177 -0.017792 -0.012146 -0.038762 0.070623 1.000000 0.110096 -0.153827 0.011578 0.076644
Age_opposite -0.029092 -0.035058 0.006335 -0.045859 -0.002334 0.037805 0.004885 0.021913 0.041661 0.110096 1.000000 -0.008329 -0.002657 0.088528
Importance_same_race -0.055506 -0.023705 -0.048842 -0.046805 -0.058198 0.007115 0.037386 -0.013042 -0.061535 -0.153827 -0.008329 1.000000 0.037849 -0.087647
Order -0.029916 -0.037148 0.008341 -0.016335 -0.071314 -0.080771 -0.113905 -0.062048 0.025861 0.011578 -0.002657 0.037849 1.000000 0.021633
Interests_correlation 0.029220 0.017535 0.047083 0.021721 0.019578 0.047087 0.013696 0.029437 0.043113 0.076644 0.088528 -0.087647 0.021633 1.000000

Model¶

Select model¶

On default, LogisticRegressionCV does a 5 folds with Stratified K-Folds so there is no need to do further training and validation.
See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html

In [ ]:
# select the linear regression model
clf = LogisticRegressionCV()

Fit model¶

In [ ]:
# fit model to data
clf.fit(X_train, y_train)
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Sscho\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[ ]:
LogisticRegressionCV()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegressionCV()

Evaluation on test set¶

In [ ]:
clf.score(X_test, y_test)
Out[ ]:
0.823876197494473

We can see that Attractivity is the most important parameter for a match, followed by shared interests and humor.

In [ ]:
res = pd.DataFrame(np.row_stack(list(X_test.columns)))
res = res.assign(coef=clf.coef_.tolist()[0])
res.sort_values('coef',  ascending=False,inplace=True)
res
Out[ ]:
0 coef
7 Attractivity 0.263401
16 Interests_correlation 0.232684
5 Humor 0.223143
11 Probability 0.220484
12 Calls 0.144159
6 Shared_interests 0.108736
8 Intelligence 0.070220
3 Race_opposite 0.061754
14 Age_opposite -0.008482
15 Order -0.011403
13 Age -0.045691
4 Importance_same_race -0.046956
2 Race -0.047146
9 Sincerety -0.065626
1 Same_race -0.065818
10 Ambition -0.149546
0 Gender -0.153839
In [ ]:
sns.barplot(res,y=0,x='coef').set(title="Coefficient per Attribute")
Out[ ]:
[Text(0.5, 1.0, 'Coefficient per Attribute')]

According to the confusion matrix we have a lot of true negatives and only a few true positives.

In [ ]:
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test);

This may be caused by the uneven number of the dataset, where there are a lot more 'no match' labels (1118) than 'match' (239).

In [ ]:
num_no_match = 0
num_match = 0

for value in y_test:
    if value == 0:
        num_no_match += 1
    elif value == 1:
        num_match += 1

print(f"Number of no match: {num_no_match}")
print(f"Number of match: {num_match}")
Number of no match: 1118
Number of match: 239

This gets even clearer when looking at the precision recall and f1 scores.

Our model is good at predicting no match (because most of the data is a no match), but very bad at predicting a match (only 50% precision with a f1 score of 0.22).

In [ ]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, target_names=['No match', 'Match']))
              precision    recall  f1-score   support

    No match       0.84      0.97      0.90      1118
       Match       0.51      0.14      0.22       239

    accuracy                           0.82      1357
   macro avg       0.67      0.55      0.56      1357
weighted avg       0.78      0.82      0.78      1357

We can see that the bad performance is not a problem of new data (the test data), because with the train data it performs similar.

In [ ]:
y_pred_t = clf.predict(X_train)

print(classification_report(y_train, y_pred_t, target_names=['No match', 'Match']))
              precision    recall  f1-score   support

    No match       0.85      0.98      0.91      4499
       Match       0.56      0.15      0.24       928

    accuracy                           0.83      5427
   macro avg       0.71      0.56      0.57      5427
weighted avg       0.80      0.83      0.79      5427

Based on the ROC Curve we may have some room for improvements with our current model.

In [ ]:
RocCurveDisplay.from_estimator(clf, X_test, y_test)
In [ ]:
y_score = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_score)
Out[ ]:
0.791012043323029
In [ ]:
(y_score > 0.21).astype(int)
Out[ ]:
array([1, 1, 0, ..., 1, 0, 0])

We can see that with high precision, recall is very low and with high recall, precision is very flow. We need to maximize both values.

In [ ]:
precision, recall, thresholds = precision_recall_curve(y_test, y_score)
disp = PrecisionRecallDisplay(precision=precision, recall=recall, estimator_name='LogisticRegressionCV')
disp.plot()
Out[ ]:
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x1db9653b910>

Try with differernt thresholds¶

In [ ]:
pred_proba = clf.predict_proba(X_test)

Precision and recall recap:¶

For precision, we use the right side of the confusion matrix: 144 / (144 + 232) = 0,38. This is the proportion of positive predictions that are actually correct.
For recall, we use the bottom side of the confusion matrix: 144 / (144 + 95) = 0,60. This is the proportion of actual positive instances that the classifier was able to identify.

A threshold of 0.21 maximises the f1 score for a match, but the precisions of calling a true match drops to 38%.

In [ ]:
df_02 = pd.DataFrame({'y_pred': pred_proba[:,1] > .21})

ConfusionMatrixDisplay.from_predictions(y_test, df_02['y_pred']);

print(classification_report(y_test, df_02['y_pred']))
              precision    recall  f1-score   support

           0       0.90      0.79      0.84      1118
           1       0.38      0.59      0.46       239

    accuracy                           0.76      1357
   macro avg       0.64      0.69      0.65      1357
weighted avg       0.81      0.76      0.78      1357

If we want to get the match precision over 50% we only find 8 true matches. This model would not be useful.

In [ ]:
df_07 = pd.DataFrame({'y_pred': pred_proba[:,1] > .7})

ConfusionMatrixDisplay.from_predictions(y_test, df_07['y_pred']);

print(classification_report(y_test, df_07['y_pred']))
              precision    recall  f1-score   support

           0       0.83      1.00      0.90      1118
           1       0.62      0.03      0.06       239

    accuracy                           0.83      1357
   macro avg       0.72      0.51      0.48      1357
weighted avg       0.79      0.83      0.76      1357

Conclusion:¶

Dataset is unbalanced and has too many non - matches

Use more balanced dataset¶

Let's try to use more balanced data, take 1000 match and no-match data.

In [ ]:
df_new = pd.concat([df[df[y_label] == 0][:1000], df[df[y_label] == 1][:1000]])

alt.Chart(df_new).mark_bar().encode(
    alt.X(y_label),
    alt.Y('count()'),
    alt.Color(y_label),
    tooltip = ['count()']
).properties(
    width=250,
    height=150
).interactive().properties(title="1000 match and no-match data")
c:\Users\Alexander\anaconda3\envs\stats\lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
Out[ ]:
In [ ]:
X = df_new[features]
y = df_new[y_label]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    shuffle=True,
                                                    random_state=42)

clf = LogisticRegressionCV()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
c:\Users\Alexander\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Alexander\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Alexander\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Alexander\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\Alexander\anaconda3\envs\stats\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[ ]:
0.7625
In [ ]:
y_pred_t = clf.predict(X_train)

print(classification_report(y_train, y_pred_t, target_names=['No match', 'Match']))

RocCurveDisplay.from_estimator(clf, X_test, y_test);
              precision    recall  f1-score   support

    No match       0.76      0.74      0.75       801
       Match       0.75      0.77      0.76       799

    accuracy                           0.76      1600
   macro avg       0.76      0.76      0.76      1600
weighted avg       0.76      0.76      0.76      1600

In [ ]:
y_score = clf.predict_proba(X_test)[:, 1]
print(f"The ROC score is: {roc_auc_score(y_test, y_score)}")
The ROC score is: 0.8456461411535289
In [ ]:
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test);
In [ ]:
precision, recall, thresholds = precision_recall_curve(y_test, y_score)
disp = PrecisionRecallDisplay(precision=precision, recall=recall, estimator_name='LogisticRegressionCV')
disp.plot()
Out[ ]:
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x1db9a3b5430>

As we can see the model performs a lot better now.

Save model¶

Save your model in the folder models/. Use a meaningful name and a timestamp.

In [ ]:
folder = '../models/'
pkl_filename = 'clf_match_20221222.pkl'
In [ ]:
with open(folder + pkl_filename, 'wb') as file:
    pickle.dump(clf, file)

Use this to load again

In [ ]:
with open(folder + pkl_filename, 'rb') as file:
    pickle_model = pickle.load(file)

pickle_model
Out[ ]:
LogisticRegressionCV()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegressionCV()

Conclusions¶

With the data from this survey and our used statistical methods and classification, we could answer the following research questions:¶

  • What are the most effective characteristics to achieve a match in opposite sex speed dating?
    • According to our model parameters the most important features for a positive correlation are attractivity, same interests, humor and that the other person also shows some interest. On the other hand there is a negative correlation for beeing the same gender and race and being (maybe too) ambitious and sincere.
  • What type of persons have the best chances to achieve a match regarding their characteristics?
    • People that have a rating of around 8 for all personal attributes, especially attractivity and humor.

To answer our research question, we defined the following sub-questions to strengthen our main research question:

  • Do specific characteristics affect the match selection of the survey participants?
    • Yes, see above.
  • Do these specific characteristics occur in both sexes?
    • Yes, there is no significant difference.
  • What type of persons participate in speed dating events?
    • The speed dating event took place at an university, but besides that, there are no significant differences in the characteristics to form types.
  • After three weeks, how many contacts did each type of person have?
    • We can't distinguish different types of persons, but overall 2/3 of the participants had at least one call.
  • Is there a significant difference between the number of men calling women or women calling men after three weeks?
    • Yes, men are calling women four times more often.

The following hypotheses support our research question:

Null hypothesis:

  • There is no affection of having specific characteristics regarding match selection of the survey participants
    • This does not hold true, the characteristics are described above.
  • There is no correlation between shared interests, attributes and getting a match
    • This does not hold true, the characteristics are described above.

Hypotheses:

  • Survey participants who both have the specific characteristics same race and opposite gender tend to achieve more matches
    • We didn't investigate that in detail, but we saw a rather negative correlation between same race and match.
  • Survey participants with a higher income tend to achieve more matches than survey participants with a lower income
    • We didn't investigate that in detail, as the income wasn't an important feature for our model.
  • Achieving matches because of having the same specific characteristics occur in both sexes
    • Yes, this hypotheses is true.
  • Three weeks after the event, males called women more often
    • Yes, by the factor of four.